Compound Value Types in RDF
... like blank nodes, just not blank
Prelude
Most facts can be expressed as a single triple of the form
:subject :predicate :object .
for instance,
:Colorado :partOf :United_States .
On the other thing, some relationships fundamentally involve more than one thing and some variables change rapidly as a function of time, and in those cases we extend the triple model by adding an extra node to model the relationship; this node plays a role similar to a table row in a database, and is called either a Compound Value Type or a Mediator in Freebase.
In RDF it is possible to create a node without a unique identifier, known as a "blank node", but in a community database like Freebase, a global name ensures that anyone really can say anything about anything. This article will teach you how to work with Compound Value types when querying Freebase data with SPARQL.
This post is a followup to How to write SPARQL queries against Freebase data and is part of a series. Subscribe to the RSS feed and to the :BaseKB mailing list for future episodes.
The data set I'm using is the 2014-03-02 edition of :BaseKB Gold. You can download this via Bittorrent and load it into any standard-complaint triple store, but it's even faster to use the pre-loaded Compact Edition which can deploy perfectly matched hardware, software and data in just one click.
Dated measurements
The :measurement_unit base contains a number of "mediators" that represent measurements taken at a moment in time. :measurement_unit.dated_money_value
is just a touch more complicated than average because it contains a currency unit:
To see how this works, I looked up the Eastman Kodak Corporation via the Freebase web UI and got the following id: :m.0ftdd
. I can then look up revenue numbers like
prefix : <http://rdf.basekb.com/ns/>
select ?valid_date ?amount {
:m.0ftdd :business.business_operation.revenue ?that .
?that
:measurement_unit.dated_money_value.amount ?amount ;
:measurement_unit.dated_money_value.valid_date ?valid_date
} order by (?valid_date)
I get the following results back
and piping the result into R I see one of the great American rise and fall stories:
Note in this case the currency and source attributes are boring because the currency is always USD and the source attribute is missing.
Adding a dimension
Although it common to look at compound value types that are associated with a topic, it is also easy with SPARQL to approach CVTs from side angles. For instance, this query turns up the top sources of dated integers
select ?source (count(*) as ?cnt) {
?value :measurement_unit.dated_integer.source ?source .
} group by ?source ORDER BY DESC(?cnt) LIMIT 5
If you need to know, the text labels were put in by hand in OneNote
The World Bank is (known as :m.02vk52z) is a data provider to Freebase, and we can find the data types it provides like so
select ?p (count(*) as ?cnt) {
?source :dataworld.information_source.provider :m.02vk52z .
?item ?p ?source .
} group by ?p order by desc(?cnt)
with the following results:
it is also possible to pull up a list of the predicates asserted by the World Bank
prefix : <http://rdf.basekb.com/ns/>
select ?predicate (count(*) as ?cnt) {
?source :dataworld.information_source.provider :m.02vk52z .
?fact ?sourceP ?source .
?something ?predicate ?fact .
filter(regex(str(?sourceP),"measurement_unit"))
} group by ?predicate order by desc(?cnt)
How many CVT nodes are there?
We can count the subjects that appear in Freebase like so:
select (count(distinct(?s)) as ?count) {
?s ?p ?o .
}
we get a total count of 108,743,069
which greatly exceeds the 43,453,748
topics, and the7,962,738
schema objects with the following query
prefix : <http://rdf.basekb.com/ns/>
select (count(distinct(?s)) as ?count) {
?s a ?o .
FILTER(REGEX(?o,'/type[.]'))
}
this leaves 57,326,583
non-topics. We can get some insight into these by looking up the top non-topics, some of which are CVTs
prefix : <http://rdf.basekb.com/ns/>
select ?nonTopicType (count(*) as ?count) {
?that a ?nonTopicType .
minus {?that a :common.topic }
} ORDER BY DESC(?count) LIMIT 50
Using the Summary data
Compound value types are marked like
?cvType :freebase.type_hints.mediator 1 .
If you try to join this against
?that a :cvType .
you get poor results because ?cvType is a mid identifier, whereas the RHS of the a
statement is a 'friendly' identifier like :film.performance
. These don't match up, so the join doesn't work. Freebase doesn't contain statements like
:m.020w ?alsoKnownAs :measurement_unit.dated_integer .
but rather you find
:m.020w :freebase.object_hints.best_hrid "/measurement_unit/dated_integer" .
This join could be completed by the use of SPARQL functions to convert slashes into periods and strings into URIs, but there's an easier way.
We can get a list of theCVTs, however, by using the summary information
:freebase.type_profile.type_count
, which results in a simple query
prefix : <http://rdf.basekb.com/ns/>
select ?name ?count {
?mediatorType :freebase.type_hints.mediator 1 ;
:freebase.type_profile.instance_count ?count ;
:freebase.object_hints.best_hrid ?name .
} order by desc(?count) LIMIT 30
and then by mixing up the query a little we get a count of all instances that are tagged as mediators:
prefix : <http://rdf.basekb.com/ns/>
select (sum(?count) AS ?sum) {
?mediatorType :freebase.type_hints.mediator 1 ;
:freebase.type_profile.instance_count ?count ;
:freebase.object_hints.best_hrid ?name .
}
which turns out to be 13,761,088
, which is a tiny fraction, much less than 10% of the relationships in the system.
People and roles
At this point you should know enough about how to look up CVT facts, even if it is a little awkward to format your results. (That's the subject of a future post) I'm going to show a few more screenshots from the Freebase Web UI that illustrate some interesting uses of CVTs.
Employment histories are a good example of something that transcends the triple, since employment is (i) a temporary condition, and (ii) involves both an employer and a job title. If we look up, say, Steve Jobs, we get
If, on the other hand, you look at good Captain Kirk you see that Freebase knows he's been played by three different actors:
Conclusion
Compound Value Types are a useful addition that Freebase adds to the RDF data model; an understanding of CVTs is essential if you want to access all of the data that exists in Freebase.
This post is a followup to How to write SPARQL queries against Freebase data and is part of a series. Subscribe to the RSS feed and to the :BaseKB mailing list for future episodes.
Creator of database animals and bayesian brains